Case task for iGaming development company:
Analysis of players' activity
Performed by Ilya Smolenskiy¶
As a part of this case task we'll be studying players' activity data of a game called Xxxxxxxx (an NFT-based B2B iGaming Solution) in order to do the following:
- Identification and description of errors in the dataset
- Preparation of a report with game's overall performance valuation
- Definition and justification of players' engagement metrics
- Proposition of additional variables for a more comprehensive study: improving players' engagement/performance
Data preprocessing and overview - Identification and description of errors¶
In this section we are going to import, preprocess and get familiar with received data. First of all we need to import libraries and data we'll be using in our task/research:
# Importing libraries:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
import matplotlib.pyplot as plt
from datetime import datetime, date, time
from IPython.display import SVG, display, display_html
We may see first 5 rows of our dataset and the description of its columns of below:
# Importing the data:
data = pd.read_csv(r'data.csv')
# Sorting and resetting index:
data = data.sort_values('CreatedAt', ascending = True)\
.reset_index(drop = True)
# Displaying first 5 rows of a dataset:
data.head()
| CreatedAt | Currency | UserId | PlatformId | MatchId | stake_currency | stake_eur | won_currency | won_eur | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023-02-25 00:00:02.304718058 | KZT | xjcxesupvw | platform_1 | 910457 | 88.16 | 0.19 | 94.33 | 0.20 |
| 1 | 2023-02-25 00:01:21.304718058 | AZN | yxbh215ay8 | platform_3 | 910460 | 0.70 | 0.39 | 0.00 | 0.00 |
| 2 | 2023-02-25 00:01:37.304718058 | KZT | l4400aezgm | platform_1 | 910461 | 138.51 | 0.29 | 274.24 | 0.58 |
| 3 | 2023-02-25 00:03:45.304718058 | AZN | yxbh215ay8 | platform_3 | 910467 | 1.23 | 0.69 | 1.62 | 0.90 |
| 4 | 2023-02-25 00:05:38.304718058 | INR | ch6bz7nbfx | platform_2 | 910473 | 18.72 | 0.22 | 0.00 | 0.00 |
Fields' description:¶
CreatedAt- date and time of result recording,Currency- player's currency name,UserId- id of a player,PlatformId- name of the platform,MatchId- id of the game,stake_currency- bet size in player's currency,stake_eur- bet size in EUR,won_currency- winning size in player's currency,won_eur- winning size in EUR
Now let's make sure that all the variables are stored in proper type:
# Converting varables to more optimal types
data['CreatedAt'] = pd.to_datetime(data['CreatedAt'])\
.astype('datetime64[s]')
data['CreatedAtDate'] = pd.to_datetime(data['CreatedAt'].dt.date)\
.astype('datetime64[s]')
data['CreatedAtTime'] = data['CreatedAt'].dt.time
# Getting PlatformId lighter in terms of memory usage:
data['PlatformId'] = pd.to_numeric(data['PlatformId'].str.replace("platform_", ""))
# Reordering columns by grouping datetime together:
data = data[['CreatedAt',
'CreatedAtDate',
'CreatedAtTime',
'Currency',
'UserId',
'PlatformId',
'MatchId',
'stake_currency',
'won_currency',
'stake_eur',
'won_eur']]
# Stating types of variables:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 99989 entries, 0 to 99988 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CreatedAt 99989 non-null datetime64[s] 1 CreatedAtDate 99989 non-null datetime64[s] 2 CreatedAtTime 99989 non-null object 3 Currency 99989 non-null object 4 UserId 99989 non-null object 5 PlatformId 99989 non-null int64 6 MatchId 99989 non-null int64 7 stake_currency 99989 non-null float64 8 won_currency 99989 non-null float64 9 stake_eur 99989 non-null float64 10 won_eur 99989 non-null float64 dtypes: datetime64[s](2), float64(4), int64(2), object(3) memory usage: 8.4+ MB
From here we may see that all variable types are set properly. We may also see that out data frame contains no missing values (N/As). Let's also check for duplicates:
# Searching for duplicates:
print("Absolute duplicates:", len(data[data.duplicated() == True]),
"\nDuplicates, excluding datetime:", len(data[data.loc[:, 'Currency':'won_eur'].duplicated() == True]))
Absolute duplicates: 0 Duplicates, excluding datetime: 0
We may see no duplicates in the dataset, let's briefly check for zero amount stakes:
data[(data['stake_currency'] == 0) | data['stake_eur'] == 0].sort_values('stake_currency', ascending = False)
| CreatedAt | CreatedAtDate | CreatedAtTime | Currency | UserId | PlatformId | MatchId | stake_currency | won_currency | stake_eur | won_eur | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 81348 | 2023-03-30 12:45:54 | 2023-03-30 | 12:45:54 | KZT | fgiedi7blr | 2 | 1057211 | 2.50 | 0.00 | 0.0 | 0.00 |
| 87781 | 2023-04-02 06:08:16 | 2023-04-02 | 06:08:16 | KZT | l4ci1wzefy | 1 | 1069176 | 2.48 | 0.00 | 0.0 | 0.00 |
| 56658 | 2023-03-19 16:25:12 | 2023-03-19 | 16:25:12 | KZT | q1zujlrz7u | 1 | 1009410 | 2.48 | 0.00 | 0.0 | 0.00 |
| 96983 | 2023-04-05 19:55:06 | 2023-04-05 | 19:55:06 | KZT | 8zsac2z4o0 | 1 | 1084929 | 2.46 | 2.66 | 0.0 | 0.01 |
| 89270 | 2023-04-02 18:03:50 | 2023-04-02 | 18:03:50 | KZT | ig83bc8dc9 | 1 | 1071367 | 2.46 | 5.26 | 0.0 | 0.01 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 54957 | 2023-03-18 23:39:45 | 2023-03-18 | 23:39:45 | TRY | glj8cicicc | 3 | 1006359 | 0.01 | 0.00 | 0.0 | 0.00 |
| 23578 | 2023-03-06 15:56:24 | 2023-03-06 | 15:56:24 | AZN | r5ah0cnpb1 | 3 | 952567 | 0.01 | 0.00 | 0.0 | 0.00 |
| 18471 | 2023-03-04 10:03:04 | 2023-03-04 | 10:03:04 | AZN | mnvqswj6lv | 3 | 942693 | 0.01 | 0.00 | 0.0 | 0.00 |
| 79281 | 2023-03-29 16:51:47 | 2023-03-29 | 16:51:47 | AZN | 95pj2grx5u | 2 | 1053578 | 0.01 | 0.00 | 0.0 | 0.00 |
| 59348 | 2023-03-20 18:54:03 | 2023-03-20 | 18:54:03 | AZN | lotrfy3sg5 | 2 | 1014335 | 0.01 | 0.00 | 0.0 | 0.00 |
260 rows × 11 columns
In the table above we may see games with stakes of zero amount, especially in euros. However, some of these cases seem to be reasonable, because the amount of stake in the local currency is too low to be displayed rather than zero in euros due to selected number of digits for rounding - let's keep these cases in the dataset.
Let's check for zero stakes in stake_currency only:
data[(data['stake_currency'] == 0)].sort_values('stake_currency', ascending = False)
| CreatedAt | CreatedAtDate | CreatedAtTime | Currency | UserId | PlatformId | MatchId | stake_currency | won_currency | stake_eur | won_eur | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 33749 | 2023-03-10 19:18:40 | 2023-03-10 | 19:18:40 | AZN | 0gdqrqabxe | 3 | 970695 | 0.0 | 0.00 | 0.0 | 0.0 |
| 85465 | 2023-04-01 07:08:16 | 2023-04-01 | 07:08:16 | AZN | pra6xdwjfs | 3 | 1064955 | 0.0 | 0.01 | 0.0 | 0.0 |
These entries are obviously wrong: stakes in the local currency equal to zero, moreover, the last row contains the case of winning without staking. Let's drop these entries from our dataset and move on to overall summary of datetime and objective variables:
# Dropping incorrect entries from the table with resetting table's index:
data = data.drop(index = [33749, 85465], axis = 0).reset_index(drop = False)
# Preparing general summary
print("Period of observation:",
"from", data['CreatedAt'].dt.date.min(),
"to", data['CreatedAt'].dt.date.max(),
"(",(data['CreatedAt'].max() - data['CreatedAt'].min()), ")",
"\nUnique players:", len(data['UserId'].unique()),
"\nNumber of sessions:", len(data['MatchId'].unique()),
"\nNumber of platforms:", len(data['PlatformId'].unique()),
"\nCurrencies/zones:", len(data['Currency'].unique())
)
Period of observation: from 2023-02-25 to 2023-04-06 ( 40 days 23:37:04 ) Unique players: 23436 Number of sessions: 73035 Number of platforms: 4 Currencies/zones: 4
Here we may see that our data frame contains historical data of almost 41 days of game's players' activity, that the whole data may be grouped by 4 currencies and 4 platforms.
Now let's observe summary on numeric variables, but first, let's add characterizing variable that will state the financial result/Gaming revenue of Xxxxxxxx for each player's game for each of <...>_currency and <...>_eur groups of variables:
# Financial result / Gaming revenue of Xxxxxxxx for every game in player's currency and euro:
data['result_currency'] = data['stake_currency'] - data['won_currency']
data['result_eur'] = data['stake_eur'] - data['won_eur']
Now let's see the summary. To make aggregative analytics legit we'll slice it by player's currency:
# Building summaries for each currency and total in EUR:
data_currency = data[['Currency',
'stake_currency',
'won_currency',
'result_currency',
'stake_eur',
'won_eur',
'result_eur']]
caption = [dict(selector="caption",
props=[("text-align", "Left"),
("font-size", "150%"),
("font-weight", "bold")])]
for currency in data['Currency'].unique():
data_currency_description = data_currency[data_currency['Currency'] == currency].describe()
data_currency_description.loc['Sum of variables in the dataframe'] = data_currency[data_currency['Currency'] == currency]\
.sum(numeric_only=True, axis=0)
data_currency_description = data_currency_description.map("{0:.2f}".format)
data_currency_description = data_currency_description.style\
.set_table_styles(caption)\
.set_table_attributes("style='display:inline'")\
.set_caption(currency)
display_html(data_currency_description._repr_html_()+'<br>'*2, raw=True)
data_currency_description = data_currency[['stake_eur', 'won_eur', 'result_eur']].describe()
data_currency_description.loc['Sum of variables in the dataframe'] = data_currency.sum(numeric_only=True, axis=0)
data_currency_description = data_currency_description.map("{0:.2f}".format)
data_currency_description = data_currency_description.style\
.set_table_styles(caption)\
.set_table_attributes("style='display:inline'")\
.set_caption("Total in EUR")
display_html(data_currency_description._repr_html_(), raw=True)
| stake_currency | won_currency | result_currency | stake_eur | won_eur | result_eur | |
|---|---|---|---|---|---|---|
| count | 57032.00 | 57032.00 | 57032.00 | 57032.00 | 57032.00 | 57032.00 |
| mean | 542.74 | 526.98 | 15.76 | 1.13 | 1.09 | 0.03 |
| std | 3425.52 | 6626.88 | 5585.52 | 7.14 | 13.85 | 11.67 |
| min | 0.17 | 0.00 | -558851.84 | 0.00 | 0.00 | -1218.80 |
| 25% | 38.05 | 0.00 | -40.12 | 0.08 | 0.00 | -0.08 |
| 50% | 71.95 | 0.00 | 33.53 | 0.15 | 0.00 | 0.07 |
| 75% | 249.77 | 93.80 | 106.37 | 0.52 | 0.19 | 0.22 |
| max | 150146.69 | 673841.93 | 150146.69 | 319.58 | 1469.58 | 316.78 |
| Sum of variables in the dataframe | 30953285.50 | 30054586.81 | 898698.69 | 64185.41 | 62349.52 | 1835.89 |
| stake_currency | won_currency | result_currency | stake_eur | won_eur | result_eur | |
|---|---|---|---|---|---|---|
| count | 10741.00 | 10741.00 | 10741.00 | 10741.00 | 10741.00 | 10741.00 |
| mean | 4.19 | 3.88 | 0.31 | 2.30 | 2.13 | 0.17 |
| std | 17.09 | 27.52 | 25.19 | 9.37 | 15.09 | 13.82 |
| min | 0.01 | 0.00 | -974.87 | 0.00 | 0.00 | -541.39 |
| 25% | 0.27 | 0.00 | -0.30 | 0.15 | 0.00 | -0.17 |
| 50% | 0.79 | 0.00 | 0.24 | 0.43 | 0.00 | 0.13 |
| 75% | 2.37 | 0.77 | 1.16 | 1.30 | 0.42 | 0.64 |
| max | 587.92 | 1126.74 | 587.92 | 320.05 | 615.68 | 320.05 |
| Sum of variables in the dataframe | 44960.50 | 41642.29 | 3318.21 | 24674.55 | 22858.42 | 1816.13 |
| stake_currency | won_currency | result_currency | stake_eur | won_eur | result_eur | |
|---|---|---|---|---|---|---|
| count | 14905.00 | 14905.00 | 14905.00 | 14905.00 | 14905.00 | 14905.00 |
| mean | 147.26 | 120.53 | 26.73 | 1.67 | 1.37 | 0.30 |
| std | 730.93 | 853.84 | 818.62 | 8.27 | 9.66 | 9.24 |
| min | 0.02 | 0.00 | -38906.13 | 0.00 | 0.00 | -435.45 |
| 25% | 8.67 | 0.00 | -11.84 | 0.10 | 0.00 | -0.14 |
| 50% | 21.82 | 0.00 | 6.54 | 0.25 | 0.00 | 0.07 |
| 75% | 88.87 | 28.65 | 31.36 | 1.01 | 0.32 | 0.35 |
| max | 21144.64 | 60050.77 | 20698.29 | 236.66 | 672.11 | 233.14 |
| Sum of variables in the dataframe | 2194959.01 | 1796487.62 | 398471.39 | 24849.15 | 20360.48 | 4488.67 |
| stake_currency | won_currency | result_currency | stake_eur | won_eur | result_eur | |
|---|---|---|---|---|---|---|
| count | 17309.00 | 17309.00 | 17309.00 | 17309.00 | 17309.00 | 17309.00 |
| mean | 54.64 | 51.56 | 3.07 | 2.67 | 2.52 | 0.15 |
| std | 215.80 | 391.35 | 349.71 | 10.44 | 18.92 | 16.89 |
| min | 0.01 | 0.00 | -20052.07 | 0.00 | 0.00 | -962.71 |
| 25% | 2.71 | 0.00 | -4.27 | 0.13 | 0.00 | -0.21 |
| 50% | 9.31 | 0.00 | 2.06 | 0.46 | 0.00 | 0.10 |
| 75% | 31.77 | 10.17 | 14.01 | 1.56 | 0.50 | 0.68 |
| max | 4680.67 | 21943.77 | 4680.67 | 224.63 | 1053.53 | 224.63 |
| Sum of variables in the dataframe | 945682.35 | 892534.90 | 53147.45 | 46133.69 | 43540.18 | 2593.51 |
| stake_eur | won_eur | result_eur | |
|---|---|---|---|
| count | 99987.00 | 99987.00 | 99987.00 |
| mean | 1.60 | 1.49 | 0.11 |
| std | 8.24 | 14.49 | 12.66 |
| min | 0.00 | 0.00 | -1218.80 |
| 25% | 0.09 | 0.00 | -0.11 |
| 50% | 0.21 | 0.00 | 0.08 |
| 75% | 0.82 | 0.25 | 0.33 |
| max | 320.05 | 1469.58 | 320.05 |
| Sum of variables in the dataframe | 159842.80 | 149108.60 | 10734.20 |
From here we may see that on average Xxxxxxxx earns about 11 euro cents every game.
We may also see, that max values in every variable in stake_eur and won_eur in total is significantly higher than the 3rd quartile and that the mean is always higher than the median (2nd quartile) - all that testifies that we have outliers of large bids/winnings in our sample and that the sample has right/positive skewness in distribution of those variables, so it's more accurate to use the median to characterize average values, so it's more accurate to say that Xxxxxxxx earns about 8 euro cents every game.
Studying game revenue result_eur we admit that it has opposite left/negative skewness, because it is an output of difference between previously named variables.
The behavior of metrics in local currency is similar. Let's justify our findings by building boxplot for total in euro:
# Building boxplot chart:
_eur_fig = px.box(data[['stake_eur',
'won_eur',
'result_eur']])
_eur_fig.update_layout(title_text="Stakes, winning amounts and gaming revenue in euro",
autosize=True,
)
display(SVG(pio.to_image(_eur_fig, format="svg",engine='orca', width=1100, height=500)))